-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic HPC Install Script #329
Conversation
Heavily inspired by the original `batch/slurm_init.sh` script. The init script is a run once script that takes care of installation of dependencies and setup whereas prerun sets env vars needed per a run.
Initial version of the HPC install script, some what inspired by the slurm init script.
* Changed how the R arrow version is formatted for readability. * Changed the final output command to print diagnostic info correctly.
Added slurm's --partition flag to the `batch/inference_job_launcher.py` script for usage on UNC's Longleaf cluster.
The longleaf specific init/pre-run scripts are now surpassed by the generic `build/hpc_install.sh` script.
Remove the --partition flag for the slurm partition to use from the inference job launcher script. This will be handled in a new flepiscripts script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good, but a few questions to address.
build/hpc_install.sh
Outdated
elif [[ $1 == "rockfish" ]]; then | ||
# Setup general purspose user variables needed for RockFish | ||
USERDIR="/scratch4/struelo1/flepimop-code/$USER/" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to cd to USERDIR as well here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and if we do, several of the $USERDIRs below can/must be eliminated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add creating some hpc-wide environmental variables to the longleaf-setup repo. does that make sense to pair with this?
lastly ... bit weird that we're doing install here in scratch. why not in $HOME? i get doing projects on scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per in-person conversation:
- need to check the preferred location for libraries on longleaf & rockfish
- maybe refer to that as $LIBDIR (or ACCLIBDIR or some such)
- might want to move that as a generic variable to be set on the HPC, and if so - move that to the longleaf-setup directions (which could itself stand to be scriptified) and make that setup a prerequisite to this? (one downside to that would be other people on other HPCs wanting to use / modify this script - future problem?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least on the longleaf side it looks like /users
is similar to $HOME
, the documentation states "Think of it as a capacity expansion to your home directory." However, I think maybe the project directory should be moved to /work
since that's high throughput and designed for active jobs. So my take is:
flepiMoP
andflepimop-env
stay in/users
, especially for the conda env since that directory can get large and$HOME
has some low and strict storage caps.- Move the project directory to
/work
since that'll actually need throughput for the job.
I still need to dig up the rockfish documentation. Longleaf docs: https://help.rc.unc.edu/getting-started-on-longleaf/#main-directory-spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these scripts - the install all worked great for me on longleaf. 🎉
I'm also open to having these things anywhere, but as @jcblemai said I think having everything (including the flepimop libraries) in /work or /scratch makes the most sense, including the flepiMoP folder itself. I understand how installing these in /users or /home would be ideal if flepiMoP was stable but from a practical perspective, I am operationally often changing things within flepimop and reinstalling things run-to-run, playing with my own different environments, jumping between branches, or jumping between different FLEPI_PATH 's (not ideal, but practically this is just what we've had to do with concurrent runs and changes). So for convenience it would be good to just have everything in the same place, imo.
Separately, with my experience with running stuff in the past I was confused with having to link the specific location of the flepimop-env . I'm fine either way, I just don't think I follow why the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saraloo is there a general class of the things you're changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put everything in /scratch/ on rockfish (as per the current doc)
This has been the case since 9ca12ed. I see, this is not the case for the $USERDIR
variable, done now as well.
and in /work/ on longleaf, for both convenience and speed.
This is done now. I think I am misinterpreting the docs (see https://help.rc.unc.edu/getting-started-on-longleaf/#main-directory-spaces) on the differences between /users
and /work
. @jcblemai what are the practical differences between the two? My interpretation was that /work
was mean for high IO short term storage for active work whereas /users
is designed for longer term lower IO (read okay?) storage for libraries/codebases.
I am operationally often changing things within flepimop and reinstalling things run-to-run, playing with my own different environments, jumping between branches, or jumping between different FLEPI_PATH
@saraloo is this normal operational behavior? This sounds like the installation script needs to be much more accommodating to flexibility if this is the case. For the different environments, do you mean switching between multiple conda envs? What makes each of these envs distinct? As far as jumping branches this script won't do anything to your flepiMoP
clone, although you can switch the branch yourself and then run this script again to update the conda env with the code from that branch ("install" is a misnomer, it really should be "install or update", I'll change the script name and make sure this is clear when writing the documentation), does that accommodate this use case? As for different $FLEPI_PATH
s this script checks if this env var is set before doing anything, and if it is just uses the set value so there should be no issue setting custom $FLEPI_PATH
s. Have you tested this yet and does it accommodate your use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to both simultaneously. No, i don't think this is normal behavior so feel free to make a judgement call on your end. Just in the past during larger periods of development which inevitably coincide with operational demands I was running two or three different diseases on significantly different gempyor and/or R inference setups from different conda environments (again, don't think this will necessarily be standard, especially now that more people can run stuff). Just flagging that there will be circumstances where flexibility is preferable and want to reduce the possibility of someone setting the wrong flepimop version they;re working on or something, or reducing having to jump around etc to switch branches.
And sorry, haven't tested the FLEPI_PATH bit yet but that makes sense and I don't anticipate any issues there with setting that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we move to the new workflow carl described today, it means that we will have custom branch for runs so you can envision someone running Flu and RSV from the same account but using two different flepiMoP branch. I however think this flexibility can be added alter with the pre-runs scripts that are mentionned below.
Sometime also when running to many parallel run we can have some filesystem lock on the packages, which is always annoying, but I would not worry about it too much.
RE to sara's questions: do we need to specify the emplacement of the conda environement ?
. @jcblemai what are the practical differences between the two? My interpretation was that /work was mean for high IO short term storage for active work whereas /users is designed for longer term lower IO (read okay?) storage for libraries/codebases.
This is correct, but flepiMoP does not support writing to other folder other than the project one, so we work from work.
That's really great thank you, ideally we would have a per-cluster specific configuration file that would populate some variables like:
Bit 3. will be used also by the runner script. We decided to do different commands on the doc instead of a script so that the error are not silent and gradually reported. I would make sure the script exit on failure maybe using rockfish.yml paths:
- final_output_path: /scratch4/struelo1/flepimop-runs/
- project_path: /scratch4/struelo1/flepimop-code/
- secrets_path: $USER/flepi_secrets.sh
init_commands:
- module purge
- module load gcc/9.3.0
- module load git
- module load git-lfs
- module load slurm
- module load anaconda3/2022.05
- conda activate flepimop-env (perhaps it's better if the above is a bash file) |
* Changed `flepiMoP` git clone to use ssh instead of http to allow for edits from HPC. * Add `set -e` to error clearly on a command failure. * Install `gempyor` from cloned `flepiMoP` repo directly, yet to do the same for R packages.
Also because the above comments focuses on what's missing but it's really awesome that this runs on longleaf (and would have been very useful right now, except that I'm running emcee). |
95035a4
to
3be416d
Compare
e9344cb
to
b1e6670
Compare
Clean up error handling after script exits a bit more nicely.
aa1de4f
This has been added now.
The first comment is a bit out of date now, usage has changed some. However, I plan on replacing the current HPC install/update guides in the flepiMoP wiki in a separate PR into the Conda environment management has changed some from the discussion in #329 (comment) per an slack discussion with @jcblemai and @saraloo. Will now use a default environment in |
Not sure where @jcblemai's comment went re branches wanted for different operational runs, but working trees seem like an option: https://stackoverflow.com/questions/2048470/git-working-on-two-branches-simultaneously - might be something to do via init script? |
Switch from using a conda environment specified by a path to a conda environment specified by a name assumed to be in `~/.conda`.
aa1de4f
to
70ced08
Compare
This option is not compatible with the `--editable` flag.
This is currently handled in
After the recent round of edits I was able to submit one of the recent Flu configs to rockfish using an environment setup and initialized using the |
Let's aim for "next up" on that - I'd like an integration of these scripts w/ the CLI to support |
Sure, can you create an issue for that to move discussion of details there? |
Required resolving conflicts in `inference`'s `DESCRIPTION` and `install_cli.R`.
The merge-base changed after approval.
Describe your changes.
This adds a generic
hpc_install.sh
script which can reproducibly setup and installflepiMoP
on both rockfish and longleaf. The script vaguely:FLEPI_PATH
environment variable.flepiMoP
.Going to add a separate PR for documentation since that needs to be merged into
gitbook-documentation
. But user usage would look something like on rockfish:wget https://raw.githubusercontent.com/HopkinsIDD/flepiMoP/refs/heads/GH-191/longleaf-batch-submission/build/hpc_install.sh vim /scratch4/struelo1/flepimop-code/ext-twillard/slack_credentials.sh chmod 600 /scratch4/struelo1/flepimop-code/ext-twillard/slack_credentials.sh source hpc_install.sh rockfish
and on longleaf:
wget https://raw.githubusercontent.com/HopkinsIDD/flepiMoP/refs/heads/GH-191/longleaf-batch-submission/build/hpc_install.sh vim /users/t/w/twillard/slack_credentials.sh chmod 600 /users/t/w/twillard/slack_credentials.sh source hpc_install.sh longleaf
And replacing the url with the appropriate one. Then keeping your environment up to date is as easy as:
source /scratch4/struelo1/flepimop-code/ext-twillard/flepiMoP/build/hpc_install.sh rockfish
on rockfish or:
source /users/t/w/twillard/flepiMoP/build/hpc_install.sh longleaf
on longleaf.
A big open question is how best to install packages. Right now it is installing
gempyor
,flepiconfig
,flepicommon
, andinference
from GitHub rather than locally. Whereas I think installing locally would be preferred dev reasons, at least in the meantime.What does your pull request address? Tag relevant issues.
One of many steps required for GH-191. Should resolve GH-308
Tag relevant team members.
@pearsonca, @shauntruelove, @MacdonaldJoshuaCaleb
Edit: Fix
wget
url to use the "raw" file instead of the pretty version.